Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Dec 04 00:32:09 2020"
I really feel that this course would be a challenging experience for me, since I am a beginner of R. I would like to learn something that I can use in my own study and research that can improve my working efficiency. I select this course from WebOodi and was attracted by the introduction of the course. The link to my GitHub repository: https://github.com/Liyuan202/IODS-project
1.read the data by using the right command on reading the table “http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt”, and dim(dimension) and str(structure)is the command to see the dimensions and structures of the table. we can see that this table has 166 observations and 7 variables including “gender”,“age”,“attitude”,“deep”,“stra”, “surf” and “points”.the first six variables are the explanatory variables and the last one(points) is the dependent variable.
lrn14 <-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt ", sep = ",", header = TRUE)
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
2.For using the gglot2 library by the command library(ggplot2), but here the thing that we need to pay attention is that we have to install the library(gglot2) first by utilizing the command “install.packages(”ggplot2“)”. From the graphical overview showed that the intercept is around 16, and the points increased with the increased attitude, and the distribution is normal.
library(ggplot2)
p1<-ggplot(lrn14,aes(x=attitude,y=points))
p1<-ggplot(lrn14,aes(x=attitude,y=points))
p2<-p1+geom_point()
p2
p3<-p2+geom_smooth(method="lm")
p4<-p3+ggtitle("student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'
3.the summary of the model: the residuals is -17.4506 (Min) and 11.4820 (Max). And attitude (p = 4.24e-08) has statistically significant relationship with points, and the p value of other explanatory variables are all above 0.05, which means they do not have significant relationship with the target variable points.
my_model <- lm(points~attitude+stra+deep, data =lrn14)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + deep, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
my_model2 <- lm(points~attitude, data =lrn14)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
4.for the model validity, the three plots all showed that the errors of the model are nomally distributed.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
It is really challenging for me!!
##read the data
Description: this data is the student alcohol consumption data, and there are 37 variables in this data. I am going to select four variables from this data for my analysis.
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc = read.csv(url, sep=",")
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Four variables that I selected: Age, failures, absences and sex
Hypothesis: 1. frade, sex, failures and absences can all predict the high alcohol consumption.
comments: the cross tabulation and charts showed that the distribution of age, failures, absences and sex for high/low alcohol use, and they showed that for age, the proportion of age 17 is the biggest proportion (35) of high alcohol use than other age group, but not that different from other groups; and male are more likely to be high alcohol use compared to female; failures 0 are more likely to be high alcohol use; absences 0 is more likely to be high use. These may partly comfirmed our Hypothesis: only failures, absences and sex have statistical significant relationship with alcohol consumption
Cross tabulation
malc <- glm(high_use ~ age + failures + absences + sex, data = alc, family = "binomial")
summary(malc)
##
## Call:
## glm(formula = high_use ~ age + failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6644 -0.8051 -0.6026 1.0478 2.1115
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.82153 1.72779 -2.212 0.0270 *
## age 0.11375 0.10394 1.094 0.2738
## failures 0.37915 0.15333 2.473 0.0134 *
## absences 0.07059 0.01795 3.932 8.43e-05 ***
## sexM 0.98875 0.24475 4.040 5.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.44 on 377 degrees of freedom
## AIC: 427.44
##
## Number of Fisher Scoring iterations: 4
summary(alc)
## school sex age address
## Length:382 Length:382 Min. :15.00 Length:382
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:382 Length:382 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :3.000
## Mean :2.806 Mean :2.565
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason nursery
## Length:382 Length:382 Length:382 Length:382
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## internet guardian traveltime studytime
## Length:382 Length:382 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Mode :character Median :1.000 Median :2.000
## Mean :1.442 Mean :2.034
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000
## failures schoolsup famsup paid
## Min. :0.0000 Length:382 Length:382 Length:382
## 1st Qu.:0.0000 Class :character Class :character Class :character
## Median :0.0000 Mode :character Mode :character Mode :character
## Mean :0.2906
## 3rd Qu.:0.0000
## Max. :3.0000
## activities higher romantic famrel
## Length:382 Length:382 Length:382 Min. :1.00
## Class :character Class :character Class :character 1st Qu.:4.00
## Mode :character Mode :character Mode :character Median :4.00
## Mean :3.94
## 3rd Qu.:5.00
## Max. :5.00
## freetime goout Dalc Walc health
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:3.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.223 Mean :3.113 Mean :1.474 Mean :2.28 Mean :3.579
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 3.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 8.25 1st Qu.: 8.00
## Median : 3.000 Median :10.50 Median :11.00 Median :11.00
## Mean : 5.319 Mean :10.86 Mean :10.71 Mean :10.39
## 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :75.000 Max. :19.00 Max. :19.00 Max. :20.00
## alc_use high_use
## Min. :1.000 Mode :logical
## 1st Qu.:1.000 FALSE:270
## Median :1.500 TRUE :112
## Mean :1.877
## 3rd Qu.:2.500
## Max. :5.000
table(high_use = alc$high_use, age = alc$age)
## age
## high_use 15 16 17 18 19 20 22
## FALSE 63 79 65 55 7 1 0
## TRUE 18 28 35 26 4 0 1
table(high_use = alc$high_use, sex = alc$sex)
## sex
## high_use F M
## FALSE 157 113
## TRUE 41 71
table(high_use = alc$high_use, failures = alc$failures)
## failures
## high_use 0 1 2 3
## FALSE 234 22 5 9
## TRUE 82 16 6 8
table(high_use = alc$high_use, absences = alc$absences)
## absences
## high_use 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## FALSE 94 2 54 2 36 4 19 6 13 3 12 1 6 0 7 1 2 0 2 0 1 1 0
## TRUE 22 1 13 4 15 0 11 2 8 0 5 1 4 3 4 2 4 1 2 1 1 0 3
## absences
## high_use 23 24 25 26 28 30 54 56 75
## FALSE 1 0 1 1 0 0 0 0 1
## TRUE 0 1 0 0 1 1 1 1 0
graphically
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
malc <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar()
alc <- mutate(alc, high_use = alc_use > 2)
g2 <- ggplot (alc, aes(high_use))
g2 + facet_wrap("age") + geom_bar()
g2 + facet_wrap("failures") + geom_bar()
g2 + facet_wrap("absences") + geom_bar()
g2 + facet_wrap("sex") + geom_bar()
library(tidyr)
glimpse(alc)
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
gather(alc) %>% glimpse
## Rows: 13,370
## Columns: 2
## $ key <chr> "school", "school", "school", "school", "school", "school", "...
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "...
gather(alc) %>% ggplot(aes(value))+ facet_wrap("key", scales = "free")+ geom_bar()
Interpretation: the summary of the fitted model showed that failures, absences and sex have statistically significant relationship with high/low alcohol use (the p value of failures, absences and sex are all below 0.05), while age do not have statistically relationship with high/low alcohol use (the p value of age is above 0.05); interpret coefficients as odds ratio, use intercept of age(-3.82) + estimate od age (0.11) * the students is age what (since the original data can not see the category of the varibles, so it cannot calculated now); the odds ratio of age is 1.12 (>1), with 95% CI being 0.0007 and 0.6274, it means that age 15 is more likely to become alcohol consumption compared to other age, the odds ratio of failures is 1.46(>1), with 95% CI being 1.0796 and 1.9778, it means that the participates who are failures are more likely to become alcohol consumption, the odds ratio of absences is 1.07(>1), with 95% CI being 1.0383 and 1.1138, it means that the participates who absences are more likely to be alcohol consumption than those who not absences, the odds ratio of sex is 2.69(>1), it means that male are more likely to be alcohol consumption. These also partly confirmed our hypothesis.
malc <- glm(high_use ~ age + failures + absences + sex, data = alc, family = "binomial")
summary(malc)
##
## Call:
## glm(formula = high_use ~ age + failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6644 -0.8051 -0.6026 1.0478 2.1115
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.82153 1.72779 -2.212 0.0270 *
## age 0.11375 0.10394 1.094 0.2738
## failures 0.37915 0.15333 2.473 0.0134 *
## absences 0.07059 0.01795 3.932 8.43e-05 ***
## sexM 0.98875 0.24475 4.040 5.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 417.44 on 377 degrees of freedom
## AIC: 427.44
##
## Number of Fisher Scoring iterations: 4
OR <- coef (malc) %>% exp
CI <- confint(malc) %>% exp
## Waiting for profiling to be done...
cbind (OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02189419 0.0007075463 0.6273627
## age 1.12047076 0.9145032988 1.3757107
## failures 1.46104027 1.0795866408 1.9778080
## absences 1.07314328 1.0382625002 1.1137734
## sexM 2.68786068 1.6739407174 4.3782085
comments: The results showed that the model can well predict the target variables, the accuracy is 74.35%. The probabilities of the first ten participants are 0.18, 0.16, 0.50, 0.14, 0.16, 0.44, 0.28, 0.18, 0.28, 0.28. And there is 12 + 86 = 98 individuals, about 25.65%, individuals are inaccurately calssified. A strategy that reveals, on average, a 30% chance of getting the guess right, and for this model, 74.35% chance can get the explanary variables for target variables right, so this model is more better than simple guessing strategy.
malc <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
summary(malc)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6629 -0.8545 -0.5894 1.0339 2.0428
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95397 0.22819 -8.563 < 2e-16 ***
## failures 0.40462 0.15024 2.693 0.00708 **
## absences 0.07294 0.01796 4.061 4.88e-05 ***
## sexM 0.98848 0.24453 4.042 5.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 418.64 on 378 degrees of freedom
## AIC: 426.64
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict (malc, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select (alc, failures, absences, sex, high_use, probability, prediction) %>% tail(382)
## failures absences sex high_use probability prediction
## 1 0 6 F FALSE 0.1799998 FALSE
## 2 0 4 F FALSE 0.1594640 FALSE
## 3 3 10 F TRUE 0.4973115 FALSE
## 4 0 2 F FALSE 0.1408685 FALSE
## 5 0 4 F FALSE 0.1594640 FALSE
## 6 0 10 M FALSE 0.4412412 FALSE
## 7 0 0 M FALSE 0.2757800 FALSE
## 8 0 6 F FALSE 0.1799998 FALSE
## 9 0 0 M FALSE 0.2757800 FALSE
## 10 0 0 M FALSE 0.2757800 FALSE
## 11 0 0 F FALSE 0.1241213 FALSE
## 12 0 4 F FALSE 0.1594640 FALSE
## 13 0 2 M FALSE 0.3058446 FALSE
## 14 0 2 M FALSE 0.3058446 FALSE
## 15 0 0 M FALSE 0.2757800 FALSE
## 16 0 4 F FALSE 0.1594640 FALSE
## 17 0 6 F FALSE 0.1799998 FALSE
## 18 0 4 F FALSE 0.1594640 FALSE
## 19 3 16 M TRUE 0.8046070 TRUE
## 20 0 4 M FALSE 0.3376586 FALSE
## 21 0 0 M FALSE 0.2757800 FALSE
## 22 0 0 M FALSE 0.2757800 FALSE
## 23 0 2 M FALSE 0.3058446 FALSE
## 24 0 0 M TRUE 0.2757800 FALSE
## 25 0 2 F FALSE 0.1408685 FALSE
## 26 2 14 F FALSE 0.4691333 FALSE
## 27 0 2 M FALSE 0.3058446 FALSE
## 28 0 4 M TRUE 0.3376586 FALSE
## 29 0 4 M FALSE 0.3376586 FALSE
## 30 0 16 M TRUE 0.5502035 TRUE
## 31 0 0 M TRUE 0.2757800 FALSE
## 32 0 0 M FALSE 0.2757800 FALSE
## 33 0 0 M FALSE 0.2757800 FALSE
## 34 0 0 M FALSE 0.2757800 FALSE
## 35 0 0 M FALSE 0.2757800 FALSE
## 36 0 0 F FALSE 0.1241213 FALSE
## 37 0 2 M FALSE 0.3058446 FALSE
## 38 0 7 M FALSE 0.3881878 FALSE
## 39 0 2 F FALSE 0.1408685 FALSE
## 40 0 8 F FALSE 0.2025430 FALSE
## 41 1 25 F FALSE 0.5680898 TRUE
## 42 0 8 M TRUE 0.4056448 FALSE
## 43 0 2 M FALSE 0.3058446 FALSE
## 44 0 0 M FALSE 0.2757800 FALSE
## 45 1 14 F FALSE 0.3709274 FALSE
## 46 0 8 F FALSE 0.2025430 FALSE
## 47 0 12 F TRUE 0.2537465 FALSE
## 48 0 4 M FALSE 0.3376586 FALSE
## 49 0 2 M FALSE 0.3058446 FALSE
## 50 1 2 F FALSE 0.1972647 FALSE
## 51 0 2 F TRUE 0.1408685 FALSE
## 52 0 2 F FALSE 0.1408685 FALSE
## 53 1 6 M TRUE 0.4692248 FALSE
## 54 0 0 F TRUE 0.1241213 FALSE
## 55 0 6 F TRUE 0.1799998 FALSE
## 56 0 8 F FALSE 0.2025430 FALSE
## 57 0 0 F FALSE 0.1241213 FALSE
## 58 0 4 M FALSE 0.3376586 FALSE
## 59 0 2 M FALSE 0.3058446 FALSE
## 60 0 2 F FALSE 0.1408685 FALSE
## 61 0 6 F TRUE 0.1799998 FALSE
## 62 0 6 F TRUE 0.1799998 FALSE
## 63 0 4 F FALSE 0.1594640 FALSE
## 64 0 2 F TRUE 0.1408685 FALSE
## 65 0 0 F TRUE 0.1241213 FALSE
## 66 0 2 F FALSE 0.1408685 FALSE
## 67 0 4 M TRUE 0.3376586 FALSE
## 68 0 4 F FALSE 0.1594640 FALSE
## 69 0 2 F FALSE 0.1408685 FALSE
## 70 0 12 F TRUE 0.2537465 FALSE
## 71 0 0 M FALSE 0.2757800 FALSE
## 72 0 0 M FALSE 0.2757800 FALSE
## 73 2 2 F TRUE 0.2691651 FALSE
## 74 0 2 M FALSE 0.3058446 FALSE
## 75 0 54 F TRUE 0.8791712 TRUE
## 76 0 6 M TRUE 0.3710132 FALSE
## 77 0 8 M FALSE 0.4056448 FALSE
## 78 0 0 F FALSE 0.1241213 FALSE
## 79 3 2 M FALSE 0.5973005 TRUE
## 80 3 2 M FALSE 0.5973005 TRUE
## 81 0 12 F FALSE 0.2537465 FALSE
## 82 0 2 M FALSE 0.3058446 FALSE
## 83 0 4 M FALSE 0.3376586 FALSE
## 84 0 10 F FALSE 0.2271275 FALSE
## 85 0 4 M FALSE 0.3376586 FALSE
## 86 0 2 F TRUE 0.1408685 FALSE
## 87 2 6 F TRUE 0.3302363 FALSE
## 88 0 4 F FALSE 0.1594640 FALSE
## 89 0 4 F FALSE 0.1594640 FALSE
## 90 1 12 M FALSE 0.5779498 TRUE
## 91 0 18 M TRUE 0.5859787 TRUE
## 92 0 0 F FALSE 0.1241213 FALSE
## 93 0 4 F FALSE 0.1594640 FALSE
## 94 0 4 F TRUE 0.1594640 FALSE
## 95 0 0 F FALSE 0.1241213 FALSE
## 96 0 6 M FALSE 0.3710132 FALSE
## 97 1 2 F FALSE 0.1972647 FALSE
## 98 0 2 M FALSE 0.3058446 FALSE
## 99 0 2 F FALSE 0.1408685 FALSE
## 100 0 6 F FALSE 0.1799998 FALSE
## 101 0 0 F FALSE 0.1241213 FALSE
## 102 0 14 M TRUE 0.5139014 TRUE
## 103 0 0 M FALSE 0.2757800 FALSE
## 104 0 4 M FALSE 0.3376586 FALSE
## 105 0 26 F FALSE 0.4855995 FALSE
## 106 0 0 M FALSE 0.2757800 FALSE
## 107 0 10 F FALSE 0.2271275 FALSE
## 108 0 8 F FALSE 0.2025430 FALSE
## 109 0 2 M FALSE 0.3058446 FALSE
## 110 0 2 M FALSE 0.3058446 FALSE
## 111 0 6 M TRUE 0.3710132 FALSE
## 112 0 4 F FALSE 0.1594640 FALSE
## 113 0 6 M FALSE 0.3710132 FALSE
## 114 1 0 F FALSE 0.1751799 FALSE
## 115 1 6 F FALSE 0.2475480 FALSE
## 116 0 10 M FALSE 0.4412412 FALSE
## 117 0 8 M FALSE 0.4056448 FALSE
## 118 0 2 M FALSE 0.3058446 FALSE
## 119 0 2 M FALSE 0.3058446 FALSE
## 120 0 2 M FALSE 0.3058446 FALSE
## 121 0 0 M FALSE 0.2757800 FALSE
## 122 1 20 M TRUE 0.7105085 TRUE
## 123 0 6 M FALSE 0.3710132 FALSE
## 124 0 2 F FALSE 0.1408685 FALSE
## 125 0 6 M FALSE 0.3710132 FALSE
## 126 0 2 F FALSE 0.1408685 FALSE
## 127 0 18 M TRUE 0.5859787 TRUE
## 128 0 0 F FALSE 0.1241213 FALSE
## 129 0 0 M TRUE 0.2757800 FALSE
## 130 0 0 F FALSE 0.1241213 FALSE
## 131 3 2 F FALSE 0.3556611 FALSE
## 132 0 8 M TRUE 0.4056448 FALSE
## 133 2 0 F FALSE 0.2414519 FALSE
## 134 0 0 F FALSE 0.1241213 FALSE
## 135 0 12 F FALSE 0.2537465 FALSE
## 136 0 16 F TRUE 0.3128168 FALSE
## 137 0 0 M FALSE 0.2757800 FALSE
## 138 0 0 F FALSE 0.1241213 FALSE
## 139 0 0 M TRUE 0.2757800 FALSE
## 140 2 0 F FALSE 0.2414519 FALSE
## 141 1 0 M FALSE 0.3633449 FALSE
## 142 0 0 F FALSE 0.1241213 FALSE
## 143 0 0 M FALSE 0.2757800 FALSE
## 144 2 8 M FALSE 0.6052127 TRUE
## 145 0 2 F FALSE 0.1408685 FALSE
## 146 0 2 F TRUE 0.1408685 FALSE
## 147 3 0 M FALSE 0.5617719 TRUE
## 148 3 0 M FALSE 0.5617719 TRUE
## 149 0 0 F FALSE 0.1241213 FALSE
## 150 3 0 F FALSE 0.3229780 FALSE
## 151 0 2 F FALSE 0.1408685 FALSE
## 152 0 0 M FALSE 0.2757800 FALSE
## 153 0 0 M FALSE 0.2757800 FALSE
## 154 3 0 M TRUE 0.5617719 TRUE
## 155 3 0 M TRUE 0.5617719 TRUE
## 156 1 6 M TRUE 0.4692248 FALSE
## 157 2 8 F TRUE 0.3632598 FALSE
## 158 3 0 M FALSE 0.5617719 TRUE
## 159 0 0 F FALSE 0.1241213 FALSE
## 160 0 2 M FALSE 0.3058446 FALSE
## 161 0 8 M TRUE 0.4056448 FALSE
## 162 3 6 F TRUE 0.4249464 FALSE
## 163 0 2 M FALSE 0.3058446 FALSE
## 164 1 4 M TRUE 0.4331208 FALSE
## 165 2 0 M FALSE 0.4610144 FALSE
## 166 3 0 M TRUE 0.5617719 TRUE
## 167 0 4 M TRUE 0.3376586 FALSE
## 168 0 0 F FALSE 0.1241213 FALSE
## 169 0 0 F FALSE 0.1241213 FALSE
## 170 0 0 F FALSE 0.1241213 FALSE
## 171 2 0 M TRUE 0.4610144 FALSE
## 172 0 2 M FALSE 0.3058446 FALSE
## 173 0 0 M FALSE 0.2757800 FALSE
## 174 3 0 F FALSE 0.3229780 FALSE
## 175 0 4 F FALSE 0.1594640 FALSE
## 176 0 4 M TRUE 0.3376586 FALSE
## 177 0 2 F TRUE 0.1408685 FALSE
## 178 0 4 M TRUE 0.3376586 FALSE
## 179 0 10 M TRUE 0.4412412 FALSE
## 180 0 4 M FALSE 0.3376586 FALSE
## 181 0 10 M TRUE 0.4412412 FALSE
## 182 0 2 M FALSE 0.3058446 FALSE
## 183 0 2 M FALSE 0.3058446 FALSE
## 184 0 0 F TRUE 0.1241213 FALSE
## 185 0 56 F TRUE 0.8938304 TRUE
## 186 0 14 F FALSE 0.2823456 FALSE
## 187 0 12 M TRUE 0.4774520 FALSE
## 188 0 2 M FALSE 0.3058446 FALSE
## 189 0 0 M FALSE 0.2757800 FALSE
## 190 0 6 F FALSE 0.1799998 FALSE
## 191 0 4 M TRUE 0.3376586 FALSE
## 192 0 10 F FALSE 0.2271275 FALSE
## 193 0 0 F FALSE 0.1241213 FALSE
## 194 0 12 M TRUE 0.4774520 FALSE
## 195 0 8 M TRUE 0.4056448 FALSE
## 196 0 0 M FALSE 0.2757800 FALSE
## 197 0 0 F FALSE 0.1241213 FALSE
## 198 0 4 M FALSE 0.3376586 FALSE
## 199 0 8 M TRUE 0.4056448 FALSE
## 200 1 24 F TRUE 0.5501125 TRUE
## 201 0 0 F FALSE 0.1241213 FALSE
## 202 0 2 F TRUE 0.1408685 FALSE
## 203 0 6 F FALSE 0.1799998 FALSE
## 204 0 4 F FALSE 0.1594640 FALSE
## 205 0 18 F FALSE 0.3449956 FALSE
## 206 0 6 F FALSE 0.1799998 FALSE
## 207 1 28 F TRUE 0.6207826 TRUE
## 208 3 5 F FALSE 0.4072279 FALSE
## 209 0 10 F FALSE 0.2271275 FALSE
## 210 0 6 F TRUE 0.1799998 FALSE
## 211 0 6 F FALSE 0.1799998 FALSE
## 212 0 10 F FALSE 0.2271275 FALSE
## 213 0 13 M TRUE 0.4956709 FALSE
## 214 0 0 F FALSE 0.1241213 FALSE
## 215 1 15 M TRUE 0.6302227 TRUE
## 216 0 12 F FALSE 0.2537465 FALSE
## 217 0 2 F FALSE 0.1408685 FALSE
## 218 2 22 F TRUE 0.6129829 TRUE
## 219 1 13 M TRUE 0.5956324 TRUE
## 220 0 3 F TRUE 0.1499290 FALSE
## 221 0 4 F FALSE 0.1594640 FALSE
## 222 0 2 F FALSE 0.1408685 FALSE
## 223 1 0 F FALSE 0.1751799 FALSE
## 224 0 2 F FALSE 0.1408685 FALSE
## 225 0 0 M TRUE 0.2757800 FALSE
## 226 0 0 F FALSE 0.1241213 FALSE
## 227 1 16 F FALSE 0.4055561 FALSE
## 228 0 10 F FALSE 0.2271275 FALSE
## 229 0 2 M FALSE 0.3058446 FALSE
## 230 0 14 M TRUE 0.5139014 TRUE
## 231 0 10 F FALSE 0.2271275 FALSE
## 232 0 14 F FALSE 0.2823456 FALSE
## 233 0 4 M FALSE 0.3376586 FALSE
## 234 0 14 M FALSE 0.5139014 TRUE
## 235 0 2 M TRUE 0.3058446 FALSE
## 236 0 18 M FALSE 0.5859787 TRUE
## 237 0 10 M FALSE 0.4412412 FALSE
## 238 0 4 M TRUE 0.3376586 FALSE
## 239 0 20 F FALSE 0.3786606 FALSE
## 240 0 2 F FALSE 0.1408685 FALSE
## 241 1 0 M TRUE 0.3633449 FALSE
## 242 0 14 M TRUE 0.5139014 TRUE
## 243 0 2 M TRUE 0.3058446 FALSE
## 244 0 0 M FALSE 0.2757800 FALSE
## 245 0 0 M FALSE 0.2757800 FALSE
## 246 0 0 M FALSE 0.2757800 FALSE
## 247 0 0 F FALSE 0.1241213 FALSE
## 248 0 6 M FALSE 0.3710132 FALSE
## 249 0 4 M FALSE 0.3376586 FALSE
## 250 3 16 M TRUE 0.8046070 TRUE
## 251 1 8 M FALSE 0.5056539 TRUE
## 252 0 0 M TRUE 0.2757800 FALSE
## 253 1 0 M TRUE 0.3633449 FALSE
## 254 0 6 M FALSE 0.3710132 FALSE
## 255 1 4 M TRUE 0.4331208 FALSE
## 256 0 0 M FALSE 0.2757800 FALSE
## 257 0 0 M TRUE 0.2757800 FALSE
## 258 1 2 M FALSE 0.3977132 FALSE
## 259 0 6 F FALSE 0.1799998 FALSE
## 260 0 12 M FALSE 0.4774520 FALSE
## 261 0 8 M FALSE 0.4056448 FALSE
## 262 0 0 F FALSE 0.1241213 FALSE
## 263 0 21 F FALSE 0.3959664 FALSE
## 264 0 2 M FALSE 0.3058446 FALSE
## 265 0 1 M FALSE 0.2905828 FALSE
## 266 0 4 F FALSE 0.1594640 FALSE
## 267 0 0 F FALSE 0.1241213 FALSE
## 268 0 13 M TRUE 0.4956709 FALSE
## 269 0 2 M TRUE 0.3058446 FALSE
## 270 0 8 F FALSE 0.2025430 FALSE
## 271 0 10 M FALSE 0.4412412 FALSE
## 272 0 0 F FALSE 0.1241213 FALSE
## 273 2 15 F TRUE 0.4873308 FALSE
## 274 0 4 F FALSE 0.1594640 FALSE
## 275 0 2 F FALSE 0.1408685 FALSE
## 276 0 2 M FALSE 0.3058446 FALSE
## 277 0 2 F FALSE 0.1408685 FALSE
## 278 0 6 F TRUE 0.1799998 FALSE
## 279 0 75 F FALSE 0.9711472 TRUE
## 280 0 22 M TRUE 0.6545527 TRUE
## 281 0 22 M TRUE 0.6545527 TRUE
## 282 1 15 F FALSE 0.3881005 FALSE
## 283 0 8 M FALSE 0.4056448 FALSE
## 284 0 30 M TRUE 0.7725216 TRUE
## 285 1 19 M TRUE 0.6952794 TRUE
## 286 0 1 F FALSE 0.1322705 FALSE
## 287 0 4 F FALSE 0.1594640 FALSE
## 288 0 4 F FALSE 0.1594640 FALSE
## 289 0 4 F FALSE 0.1594640 FALSE
## 290 0 2 M FALSE 0.3058446 FALSE
## 291 0 5 F FALSE 0.1694845 FALSE
## 292 0 6 F FALSE 0.1799998 FALSE
## 293 0 6 M FALSE 0.3710132 FALSE
## 294 0 9 M FALSE 0.4233435 FALSE
## 295 0 11 M TRUE 0.4592929 FALSE
## 296 0 0 F FALSE 0.1241213 FALSE
## 297 0 6 F FALSE 0.1799998 FALSE
## 298 0 8 M FALSE 0.4056448 FALSE
## 299 0 4 M FALSE 0.3376586 FALSE
## 300 0 0 F TRUE 0.1241213 FALSE
## 301 0 10 F FALSE 0.2271275 FALSE
## 302 0 0 F FALSE 0.1241213 FALSE
## 303 0 5 M FALSE 0.3541586 FALSE
## 304 0 14 F FALSE 0.2823456 FALSE
## 305 0 0 M FALSE 0.2757800 FALSE
## 306 0 0 F FALSE 0.1241213 FALSE
## 307 0 0 F FALSE 0.1241213 FALSE
## 308 0 0 M FALSE 0.2757800 FALSE
## 309 1 0 M FALSE 0.3633449 FALSE
## 310 0 0 F FALSE 0.1241213 FALSE
## 311 0 9 F FALSE 0.2145795 FALSE
## 312 0 0 F TRUE 0.1241213 FALSE
## 313 0 2 F TRUE 0.1408685 FALSE
## 314 0 23 F FALSE 0.4313299 FALSE
## 315 0 12 F FALSE 0.2537465 FALSE
## 316 0 3 F FALSE 0.1499290 FALSE
## 317 0 1 F TRUE 0.1322705 FALSE
## 318 0 0 F TRUE 0.1241213 FALSE
## 319 0 3 M FALSE 0.3215447 FALSE
## 320 0 3 M TRUE 0.3215447 FALSE
## 321 0 8 M TRUE 0.4056448 FALSE
## 322 0 7 F FALSE 0.1910175 FALSE
## 323 0 7 F FALSE 0.1910175 FALSE
## 324 0 4 F FALSE 0.1594640 FALSE
## 325 0 2 M TRUE 0.3058446 FALSE
## 326 0 7 F FALSE 0.1910175 FALSE
## 327 0 0 F FALSE 0.1241213 FALSE
## 328 0 0 F FALSE 0.1241213 FALSE
## 329 0 0 F FALSE 0.1241213 FALSE
## 330 0 16 F FALSE 0.3128168 FALSE
## 331 0 0 F TRUE 0.1241213 FALSE
## 332 0 7 F FALSE 0.1910175 FALSE
## 333 0 4 F TRUE 0.1594640 FALSE
## 334 1 0 M FALSE 0.3633449 FALSE
## 335 1 0 M FALSE 0.3633449 FALSE
## 336 0 11 M FALSE 0.4592929 FALSE
## 337 1 0 F FALSE 0.1751799 FALSE
## 338 0 4 F FALSE 0.1594640 FALSE
## 339 0 7 F TRUE 0.1910175 FALSE
## 340 0 9 M FALSE 0.4233435 FALSE
## 341 0 0 M TRUE 0.2757800 FALSE
## 342 0 0 F FALSE 0.1241213 FALSE
## 343 1 10 M TRUE 0.5420231 TRUE
## 344 3 8 M TRUE 0.6967457 TRUE
## 345 0 2 M TRUE 0.3058446 FALSE
## 346 1 7 M TRUE 0.4874227 FALSE
## 347 1 4 M TRUE 0.4331208 FALSE
## 348 0 4 M FALSE 0.3376586 FALSE
## 349 0 0 F FALSE 0.1241213 FALSE
## 350 0 4 F FALSE 0.1594640 FALSE
## 351 0 2 F FALSE 0.1408685 FALSE
## 352 0 4 M FALSE 0.3376586 FALSE
## 353 0 0 F FALSE 0.1241213 FALSE
## 354 0 0 F FALSE 0.1241213 FALSE
## 355 0 0 F FALSE 0.1241213 FALSE
## 356 0 0 F FALSE 0.1241213 FALSE
## 357 0 4 M TRUE 0.3376586 FALSE
## 358 0 0 M FALSE 0.2757800 FALSE
## 359 1 0 F FALSE 0.1751799 FALSE
## 360 0 0 F FALSE 0.1241213 FALSE
## 361 0 10 F TRUE 0.2271275 FALSE
## 362 0 3 M TRUE 0.3215447 FALSE
## 363 0 8 F FALSE 0.2025430 FALSE
## 364 0 14 F FALSE 0.2823456 FALSE
## 365 0 0 F FALSE 0.1241213 FALSE
## 366 0 2 F FALSE 0.1408685 FALSE
## 367 0 4 F TRUE 0.1594640 FALSE
## 368 0 0 F FALSE 0.1241213 FALSE
## 369 0 17 F TRUE 0.3287053 FALSE
## 370 0 4 M TRUE 0.3376586 FALSE
## 371 0 5 M FALSE 0.3541586 FALSE
## 372 0 2 M FALSE 0.3058446 FALSE
## 373 1 0 M FALSE 0.3633449 FALSE
## 374 1 14 M TRUE 0.6130701 TRUE
## 375 0 2 F FALSE 0.1408685 FALSE
## 376 0 7 F FALSE 0.1910175 FALSE
## 377 1 0 F FALSE 0.1751799 FALSE
## 378 0 0 F FALSE 0.1241213 FALSE
## 379 1 0 F FALSE 0.1751799 FALSE
## 380 1 0 F FALSE 0.1751799 FALSE
## 381 0 3 M TRUE 0.3215447 FALSE
## 382 0 0 M TRUE 0.2757800 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 12
## TRUE 86 26
library(dplyr)
library(ggplot2)
galc <- ggplot (alc, aes(x = age, y = high_use, col = prediction))
galc + geom_point()
galc <- ggplot (alc, aes(x = failures, y = high_use, col = prediction))
galc + geom_point()
galc <- ggplot (alc, aes(x = absences, y = high_use, col = prediction))
galc + geom_point()
galc <- ggplot (alc, aes(x = sex, y = high_use, col = prediction))
galc + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.03141361 0.70680628
## TRUE 0.22513089 0.06806283 0.29319372
## Sum 0.90052356 0.09947644 1.00000000
Comments: the prediction error of my model is not higher but equal to the model introduced in DataCamp. Since, I choose age, failures, absences and sex as selected variables, but the age do not have statistically relationship with high/low alcohol use, so the error is 0.2670157, so the variables used in my model is bigger to Datacamp.
library(boot)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2565445
cv <- cv.glm(data = alc, cost = loss_func, glmfit = malc, K = 10)
cv$delta[1]
## [1] 0.2696335
(Comments: The Boston data is about the housing value in suburbs of Boston. It has 506 rows and 14 columns. The columns include crim, zn, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
(Interpretation: From the summary of the data, it showed that the Min, max and mean od the variables as following: crim (Min: 0.006; Max:88.976;M:3.614), zn (Min: 0.00; Max: 100; M:11.36), indus(Min: 0.46; Max:27.74; M:11.14),chas(Min:0.00; Max: 1.00;M:0.07), nox:(Min:0.385; Max: 0.871; M: 0.555); rm(Min:3.56; Max:8.78;M:6.28); age(Min:2.9; Max: 100; M: 68.57); dis(Min: 1.13; Max: 12.13;M:3.80); rad(Min: 1.00;Max:24.00;M:9.55);tax(Min:187.0; Max:711.0;M:408.2); ptratio(Min:12.60;Max:22.00;M: 18.46);black(Min:0.32;Max:396.90; M:356.67); lstat (Min:1.73; Max:37.97; M: 12.65) and medv(Min:5.00; Max:50;M:22.53), which means that the crime rate have big differences from the min to max, and this is also the characteristics for other variables. About the relationship between variables, the correlations between variables are varied from 0.04 to 0.77)
(1)summary
summary (Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
(2)correlation and graph
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
cor_matrix<-cor(Boston) %>% round(digits=2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
pairs(Boston, col = "blue", pch = 18, main = "Matrix plot of the variables")
(The Min., 1st Qu., Median, Mean, 3rd Qu. and Max. of the variables are changed, the maximum of the variables is 10.00)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled
correct_classes <- test$crime # save crime
test <- dplyr::select(test, -crime) # remove crime
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2722772 0.2425743 0.2400990 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 0.90919019 -0.9191359 -0.129161786 -0.8720604 0.48082849 -0.8636355
## med_low -0.09128296 -0.2975754 0.008892378 -0.5606031 -0.14062830 -0.3497587
## med_high -0.38910901 0.2300571 0.255323541 0.4114345 0.07841763 0.4550310
## high -0.48724019 1.0171737 -0.073485621 1.0863240 -0.40566663 0.7826302
## dis rad tax ptratio black lstat
## low 0.8623228 -0.6843138 -0.7432453 -0.45648302 0.3720747 -0.76881485
## med_low 0.3535233 -0.5494382 -0.4727500 -0.07561246 0.3662437 -0.13704902
## med_high -0.4147473 -0.3922459 -0.2857827 -0.27803256 0.1100240 0.03954604
## high -0.8331652 1.6375616 1.5136504 0.78011702 -0.8528940 0.92231816
## medv
## low 0.569044045
## med_low -0.001458999
## med_high 0.100230773
## high -0.756219401
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11181739 0.54446872 -0.9176924
## indus 0.01572057 -0.32099513 0.2069232
## chas -0.07199338 -0.12937165 0.1028424
## nox 0.38581274 -0.62647705 -1.2292934
## rm -0.14672156 -0.04664981 -0.2189422
## age 0.24023024 -0.41984856 -0.2131727
## dis -0.09390671 -0.13328143 0.1739804
## rad 3.12491797 0.90408744 -0.1759023
## tax 0.05212747 0.03307747 0.6492297
## ptratio 0.12285126 0.06566470 -0.2136849
## black -0.17018681 -0.01493487 0.1410618
## lstat 0.21930630 -0.05183435 0.4108998
## medv 0.21477095 -0.21145902 -0.1601091
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9515 0.0362 0.0122
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){ heads <- coef(x)
arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices[1]], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
(Comments: In the cross tabulate, we can see that the correct and predicted number of crime categories with four categories, including low, med-low, med-high and high. The correct and predicted are equal on low with 70, med-low with 77, med-high with 80, and high with 126.)
(1)predict the the classes with the LDA model
lda.pred <- predict(lda.fit, newdata = test)
(2)cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 86 37 4 0
## med_low 28 75 23 0
## med_high 2 39 79 6
## high 0 0 1 126
Interpretation: the summary of the distance showed that the min is 2.016, the median is 279.505, the mean is 342.899 and the max is 1198.265; the optimal number cluster is 2 and so I run the the algorithm again with the centers is 2.
(1)Reload and standardize the Boston dataset
library(MASS)
data("Boston")
summary("Boston")
## Length Class Mode
## 1 character character
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
(2)Calculate the distance between the variables
dist_woman <- dist(boston_scaled, method = 'manhattan')
summary(dist_woman)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
(3)Run k-means algorithm
km <- kmeans(boston_scaled, centers = 3)
(4)Investigate the optimal number clusters and run algorithm again
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
(5)Investigate the best optimal cluster number and run it again and visualize the clusters
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
Bonus: Perform the k-means with >2 clusters
km2 <-kmeans(boston_scaled, centers = 4)
pairs(boston_scaled, col = km2$cluster)
Super bonus
model_predictors <- dplyr::select(train, -crime)
check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep =",", header = T)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(human)
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(corrplot)
library(dplyr)
cor(human) %>% corrplot
Comments: The results of the graph and summary showed that there are 8 variables, including Edu2.FM, Labo.FM, Edu.Exp,Life.Exp, GNI, Mat.Mor, Ado.Birth and Parli.F.; About the relationships between them:In the figure below, red cicles indicate negative correlations and blue positive. The bigger and darker the circle, the higher the correlation. High positive correlation, i.e., Edu.exp and Life.exp.In turn, negative correlations i.e., Mat.Mor and Life.exp. The distributions of the variables is that the shape of it showed some are left skew, some are right skew, and some are symmetric, and the mean of GIN is 17628, other means can also seen in the figures and summary.
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Interpretation the results of unstandardized and standardized: As shown in the figure, the results of two analysis are different, the reason for this is that PCA is usually a numerical approximation decomposition rather than seeking eigenvalues, singular values to obtain analytical solutions, so when we use gradient descent and other algorithms for PCA, we have to standardize the data first. This is Conducive to the convergence of the gradient descent method.
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human_stan <- prcomp(human_std)
biplot(pca_human_stan, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
Interpretation of plots: the plot of unstandized PCA showed that all variables are pointing like PC1, so all the variables contribute to PC1. But the plot of standized PCA showed that Parli.F and Labo.FM are pointing like PC2, so these two variables contribute to PC2, while other varibles contribute to PC1. The index of human development consists of various criteria, including life expectancy at birth, education expectancy and ect.. The education expectancy do not have a good relation with PC2.
The first principal component dimension is correlated with Edu.exp, Life.exp, GNI, Edu2.FM, Mat.mor and Ado.Birth, while the second component dimensions is correlated with Parli.F and Labo.FM. And the angle between arrows representing the correlation between the variables, like the angle between the percentage of Female in Parliament and Labour Force Participation Rate Female is small, so they have high positive correlation. But the angle between Education.exp and Parli.F is bigger than the angle between the percentage of Female in Parliament and Labour Force Participation Rate Female, so the Education.exp and Parli.F are not correlated like percentage of Female in Parliament and Labour Force Participation Rate Female
library(FactoMineR)
data(tea) #load tea
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
library(tidyr)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) #visulize tea
## Warning: attributes are not identical across measure variables;
## they will be dropped
mca_tea <- MCA(tea_time, graph = FALSE) #run MCA
summary(mca_tea) # show result of MCA
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca_tea, invisible=c("ind"), habillage = "quali") # visualize MCA
Interpret the result of MCA: The results showed that besides alone and other, the coordinate of other variables in Dim1 are all significantly different from zero, so reject the null hypothesis (the value below -1.96 or above 1.96). And that would also for variables in Dim2 and Dim3. About the categorical variables, the value of Tea, how and where close to one in Dim1, indicates the strong link with these variables and Dim1. And this also for Dim2 and Dim3.
Interpret the biplot: The distance between variable categories showed that measure of their similarity. Thus, for example, the alone and unpackaged are similar, and no sugar and grey are similar, while alone is different from lemon.
Interpretation: RATS data is a wide form data and it has to be changed to long form data. The commands below is to making the RATS (wide form) to RATSL (long form). Long-form data sets are often required for advanced statistical analysis and graphing. And now it is ready for the analysis.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
library(dplyr)
library(tidyr)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- RATS %>% gather(key = WD, value = Weight, -ID, -Group)
RATSL <- RATSL %>% mutate(Time = as.integer(substr(WD,3,4)))
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
Interpretation: The graphs showed that the group 1 rats' weights significant different from group 2 and group 3. The weight of group 1 rats is the smallest among these three groups.Group 2 have the bigest rat, but other rats in group group 2 are smaller than the rats in group 3. For the amount of the rats, the amount of rats in group 1 is higher than that in group 2 and 3.
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
Interpretation: Now, the weights of rats in group 1, 2, and 3 were all standardized.
And the plot showed that the highest weight of group 2 did not increased as that in the plot of unstandardized data, and the rats that have not that higher weight did not not show that change. This means that the higher weight rats would be influenced more than lower weight rats after the standardization.
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Weight")
Interpretation: First, the RATSS with the mean and standard error of the variable Weight, and then the summary of the RATSS showed that the row is 33 and the columns is 4, including Group, Time, mean, and se.
The plot of mean profiles showed that three groups about the mean Weight values may differ from each other, since there is no overlap among the three rats group mean profiles. And we can name group one the low weight group, group 2 moderate weight group, and group 3 high weight group respectively.
library(dplyr)
library(tidyr)
library(ggplot2)
n <- RATSL$Time %>% unique() %>% length() # number of days
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight) / sqrt(n)) %>%
ungroup() # summary data with mean and standard error of Weight by Group and day
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSS) # glimpse the data
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3...
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) + geom_line() +
scale_linetype_manual(values = c(1,2,3)) + geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) + theme(legend.position = c(0.8,0.8)) +
theme_bw() + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)") # plot the mean profiles
Interpretation: First, the BPRSL8S was created, and from the summary, we can see that there are 16 rows and 3 columns, including Group, ID, and mean.
The plot showed that: The rats in group 1 had the lowest mean weight among three groups, and the rats in group 2 is higher than group. The rats in group 3 have the highest mean weight. And all groups have outlier.
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup() # create a summary by Group and WD with mean
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSL8S) # glimpse RATSL8S
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 2...
ggplot(RATSS, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 1-64") # draw a boxplot of mean versus Group
Interpretation: Now, the outlier were removed. It can also be seen in the plot. The other information was similar as the last plot. The group 3 had the highest rat weight, then the group 2, and the group 1 had the lowest rat weight.
RATSL8S1 <- RATSL8S %>% filter(ID != 2 & ID != 12 & ID != 13) # create a new data by filtering the outlier
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 1-64") # draw the plot again
Interpretation: The result showed that there is significant difference between group1, group2 and group3 in terms of weight of rats, since the p value is 2.721e-14, which is less than 0.05.
oneway.test(mean ~ Group, data = RATSL8S1, var.equal = TRUE) # run Anova
##
## One-way analysis of means
##
## data: mean and Group
## F = 2577.4, num df = 2, denom df = 10, p-value = 2.721e-14
fit <- lm(mean ~ Group, data = RATSL8S1) # run linear model
anova(fit) # compute the analysis of variance table in order to fitted model with anova
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 175958 87979 2577.4 2.721e-14 ***
## Residuals 10 341 34
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: BPRS data is a wide form data and it has to be changed to long form data. The commands below is to making the BPRS (wide form) to BPRSL (long form). Long-form data sets are often required for advanced statistical analysis and graphing. And now it is ready for the analysis.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment) #convert BPRS treatment
BPRS$subject <- factor(BPRS$subject) # convert BPRS subject
library(tidyr)
library(dplyr)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject) # convert BPRS to long form
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5))) # add week to BRPS
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Interpretation: First, the dimension showed that there are 360 rows and 5 columns.
The plot showed the all 40 men's bprs against time, it passed the repeated data, and identify each observation belong to which group. In addtion, the two groups is randomly distributed.
dim(BPRSL) # look at the dimension
## [1] 360 5
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_text(aes(label = treatment)) +
scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top") # draw the plot
Interpretation: The bprs is predictor variable and week is explanatory variable in this linear regression model. The treatment in group 1 is the baseline, and the treatment in group 2 can be 47.0261. There is no differece between two groups in term of treatment, since the p value of treatment2 is 0.661, which is more than 0.05.
BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL) # create a regression model
summary(BPRSL_reg) # print out the model summary
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Interpretation: In this model, the bprs is the predictor variable, and week and treatment are the explanatory variables. The result showed that AIC of the model is 2748.7, BIC is 2768.1, logLik is -1369.4, abd df.resid is 355. The min of scaled residuals is -3.0481, and the max of it is 3.4855. For the random effects, he intercept of the subject is 47.41, and the Std.Dev is 6.885, which indicates that subject may have make considerable differences in the model. The t value of week in fixed effects is 24.334 (>0), which means there is a greater evidence that there is significant difference. The correlation of fixed effects between week and treatment2 is 0.000.
library(lme4) # load lme4
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE) # run random intercept model
summary(BPRS_ref) # print out summary
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Interpretation: The result showed that the fixed effects are quite similar to random intercept model. The random intercept and slope model offered a better fit for the data because of the small AIC value. The p value is 0.03, less than 0.05, which indicated that the difference is significant. And the chi-squared statistic is 7.27, and the degree of freedom is 2.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE) # create a random intercept and random slope model
summary(BPRS_ref1) # print out the summary
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref) # run anova
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The result showed that the fixed effects and AIC value are quite similar to the last model. The p value is 0.07, more than 0.05, which indicated that the difference is not significant. And the chi-squared statistic is 3.17, and the degree of freedom is 1.
The two plots showed that the fitted bprs profiles from the interaction model and observed bprs profiles. The first one is the observed data, and the second one illustrates how well the interaction model fits the fitted data.
BPRS_ref2 <- lmer(bprs ~ + week * treatment + (week | subject ), data = BPRSL, REML = FALSE) # create a random intercept and random slope model with the interaction
summary(BPRS_ref2) # print a summary of the model
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ +week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1) # run anova
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ +week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(ggplot2)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, color = interaction(subject, treatment)))
p2 <- p1 + geom_line() + geom_point()
p3 <- p2 + scale_x_continuous(name = "week")
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "none")
p6 <- p5 + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7 # draw a plot
Fitted <- fitted(BPRS_ref1) # create a vector of the fitted values
BPRSL <- BPRSL %>%
mutate(Fitted) # create a vector of the fitted values
p1 <- ggplot(BPRSL, aes(x = week, y = Fitted, color = interaction(subject, treatment)))
p2 <- p1 + geom_line() + geom_point()
p3 <- p2 + scale_x_continuous(name = "week")
p4 <- p3 + scale_y_continuous(name = "bprs")
p5 <- p4 + theme_bw() + theme(legend.position = "none")
p6 <- p5 + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7
graph1; graph2